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ABSTRACT 



A system for automatically tracking objects through a video 
sequence uses a method which automatically tracks an 
object by estimating the location of the object in a subse- 
quent frame, comparing test windows in the subsequent 
frame with an object window encompassing the object in the 
original frame, and selecting the best match window which 
is most similar to the object window. The location of the best 
match window corresponds to the location of the object in 
each frame of the video sequence. The resulting location 
information can be combined with the video sequence to 
create interactive video applications in which an user of the 
application can select individual objects in the interactive 
video sequence. 

4 Claims, 6 Drawing Sheets 
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VIDEO OBJECT TRACKING METHOD FOR 
INTERACTIVE MULTIMEDIA 
APPLICATIONS 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

This invention relates to the tracking of objects in the 
frames of a video sequence for interactive multimedia 
applications, interactive television, and games, for example. 

2. Description of the Related Art 

As computers have improved in processing power, dis- 
play capability, and storage capacity, computer applications 
incorporating video sequences have become commonplace. 
One important feature for the success of these "video 
applications" is for a user to interact with objects in the 
video sequence. 

For example, an educational program about marine life 
can incorporate a video sequence of marine life to more 
vividly display the various creatures of the ocean. Ideally, 
the program is interactive so that, when a child using a 
pointer on the screen selects a particular creature, the 
computer either states the scientific name and a short 
description of the creature or displays information about the 
creature on the screen. The computer must correlate the 
location of the pointer on the screen with the location of the 
creatures in the image to determine which creature was 
selected. If the video sequence is generated by the computer, 
then the computer should know the location of each creature. 
However, for realism an actual video sequence of real ocean 
creatures should be digitized and displayed by the computer. 
In this case, the computer is only given a series of images 
and will not know the location of each creature within the 
image. Consequently, a separate data structure must be 
created containing the location of each creature in each 
frame of the video sequence in order for the computer to 
correlate the location of the pointer to the location of the 
creatures to determine which creature was selected. In other 
programs, the location of other objects must be tracked 
much as the location of the ocean creatures are tracked. 
Typically, the interactive video application will combine the 
original video sequence and the location information into an 
interactive video sequence on a computer readable medium, 
such as CD-ROMS, magnetic disks, or magnetic tapes. 

The advances in computing have been transplanted into 
the video industry in the form of microprocessor controlled 
"set- top boxes" for applications such as interactive televi- 
sion with video on demand systems. In an interactive 
television system, it is desirable to allow the user to select 
objects on the television screen and receive information 
regarding that object. For example, a customer may request 
a video about Pro Football's greatest games which will 
contain various video sequences of significant games. A 
customer will derive greater enjoyment from the video if he 
is able to select players using a pointer on the screen to 
receive additional information on the screen about the 
selected player. For example, the customer may be interested 
in the player's game statistics or career statistics. In order to 
determine which player has been selected, the microproces- 
sor must be able to correlate the location of the pointer to the 
location of the players. Therefore, each player image must 
be tracked as a separate object throughout the video 
sequence. Furthermore, the microprocessor must receive the 
object data as well as the desired information about the 
players at the same time the video is sent to the set-top box. 
The interactive video sequence combining the original video 
and the location information can be sent to the set-top box 
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using various transmission lines, such as co -axial cable, 
phone lines, or fiber-optic cabling. Alternatively, the inter- 
active video sequence can be broadcast to the set- top boxes 
over the airwaves. 

s Unless the processor knows the location of all objects in 
the video sequence, the processor will be unable to discern 
which object the user wishes to select. Consequently, in 
order to provide interactive television or interactive video 
applications, the location of each object in each frame of the 

10 original video sequence must be tracked for the processor. 
Typically, object locations are generated by having an 
operator manually mark objects on a computer display for 
each frame of the video sequence. For the marine life video 
the operator would first identify a particular creature to be 
tracked. The operator would then use a mouse or other user 

15 interface hardware to draw a rectangle around the creature 
on every frame of the image. The computer then stores the 
information on that creature for that frame. Once the opera- 
tor finishes marking the frame, he must proceed to the next 
frame. This continues for each frame of the video sequence 

20 and for each creature on the video sequence. Alternatively, 
the operator can mark and label every creature on each frame 
before proceeding to the next frame. 

Full motion video currently uses thirty frames per second 
so for even one minute of video 1800 frames of video are 

25 generated. Therefore it is extremely inefficient and tedious to 
track the objects manually. Since the objects will probably 
not move far between frames, some conventional systems 
attempt to estimate the motion of tracked objects using 
interpolation. For example an operator may manually track 

30 an object once for every second of the video sequence. The 
system then estimates the location of the object in the frames 
between the manually tracked frames by linear interpolation 
using the position of the object in the manually tracked 
frames. This method can produce reliable results for objects 

35 which exhibit linear motion; however, for non linearly 
moving objects this method is very inaccurate. 

Hence there is a need for a method or system to auto- 
matically and accurately track an object through a video 
sequence more rapidly, more accurately, and more efficiently 

40 than conventional methods. 

SUMMARY OF THE INVENTION 
In accordance with this invention, the objects within a 
video sequence are tracked throughout the video sequence. 

45 Instead of requiring a user to manually track the objects 
through each frame of the video sequence, the present 
invention will automatically track any specified object in the 
video sequence. 

Specifically, in one embodiment the system accepts a 

50 specified object window enveloping the object to be tracked 
in a first frame. The system then estimates the position of the 
object in a second frame based on the position of the object 
in a selected number of previous frames. Next, all possible 
test windows, which may contain the object, within a 

55 predetermined distance from the estimated position are 
compared with the object window. The system then selects 
the best match window, i.e. the window most similar to the 
object window, from the test windows. 

The automatic tracking method can use various measures 

60 to determine which test window is most similar to the object 
window. For example, in one embodiment the method may 
incorporate edge detection with chamfer distance 
calculations, in which the test window with the lowest 
chamfer distance is the best match window. In another 

65 embodiment, intensity distance is used. Tracking quality can 
be controlled by setting thresholds for the similarity mea- 
sures. 
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The system can be enhanced through the use of a graphi- A typical object tracking session begins when the user 

cal user interface (GUI). The user may initialize the image opens a video file. The user then selects play button 231 to 

object of interest by drawing an object window that enve- watch the video sequence. Once the user sees an object he 

lopes the object- A monitor window is provided so that the or she wishes to track, the user selects Pause button 232 to 

user can monitor the tracking progress. The user may stop or 5 pause the video sequence. The user then selects Initialize 

restart an object tracking process if the user desires. bu " on *H i to begin the tracking process. The user then uses 

pointer 250 to define the object window around the object so 

BRIEF DESCRIPTION OF THE DRAWINGS that the object window envelopes the object. The object 

„„ * , , . , . . , window can be any arbitrary shape; however, a rectangular 

FIG. 1 shows a computer system which can be used to window {& typically ^cd. The user then selects Track button 

implement an object tracking system in accordance with the 10 2M which instructs the syslem to automatically track the 

present invention. object in subsequent frames. The system will display each 

FIG. 2 shows a graphical user interface which can be used frame of the video sequence in image window 210 as it is 

with an object tracking system in accordance with the processed so that the user can interrupt the process or 

present invention. re-initialize the process if he desires. If the system fails to 

FIG. 3 shows a process flow diagram of an implementa- 15 track the object in a frame the system will warn the user and 

tion of the object tracking system in accordance with the wait for further instructions from the user. Alternatively, a 

present invention using intensity distance. ^ ca ° f ™? l ™ ™f ^ threshold to allow the 

r A , a j- c • i t system to automatically track the object while the user 

FIG 4 shows a process flow dia^am of an implementa- ^ J image ^ w 2m Jf thc {qw 

tion of the object tracking system in accordance with the 2Q threshold tracking becomes inaccurate the user can then 

present invention using chamfer distance. manually interrupt the automatic tracking to reinitialize the 

FIG. 5 shows a process flow diagram of an implementa- tracking at the point the automatic tracking became inaccu- 

tion of the object tracking system in accordance with the rate by respecifying the object window, 

present invention using intensity distance and object win- Depending on the particular embodiment of the invention 

dow replacement. 25 used, other warning messages may also be displayed. For 

FIG. 6 shows a process flow diagram of an implementa- example, in one embodiment the system can replace the 

tion of the object tracking system in accordance with the original object window with a replacement window if cer- 

present invention using chamfer distance and object window tain conditions (to be described in detail below) are met. 

replacement. Furthermore, the system can track multiple objects simul- 

3 q taneously if the user selects multiple objects during the 

DETAILED DESCRIPTION initialization phase of the process. 

According to the principles of this invention, the ineffi- FIG. 3 shows a process flow diagram for automatically 

ciencies imposed by manual tracking of objects have been tracking objects in a video sequence. In object window 

overcome. The present invention automatically tracks one or specification 310, the area around the object to be tracked is 

more objects through a video sequence. For example, in one 55 specified as an object window. The object window can be 

embodiment of the invention, after the user specifies an any arbitrary shape. An object window in the same shape as 

object in one frame of the video, the invention will auto- the object is likely to lead to more accurate tracking; 

matically track the object in subsequent frames until the however, a rectangular window can be processed more 

tracking quality falls below a predetermined threshold. rapidly without significantly reducing accuracy. A compro- 

Therefore, the present invention eliminates the need for a 4Q mise can be achieved between a rectangular window and a 

user to manually track an object through every frame of the window shaped like the object by using a window in the 

video sequence. shape of a polygon or a circle which is similar to the shape 

FIG. 1 shows computer system 100 which can implement of the object. In accordance with this invention, a window of 

the invention. Computer system 100 has processor 110 any appropriate shape can be used. In some embodiments of 

coupled to memory 120, user interface hardware 130, and 45 the invention, the initial object window is specified by the 

display screen 140. User interface hardware 130 is typically system. For example, for the marine video, the system can 

a keyboard and a mouse. However, other user interface be programmed to find a particular type of marine creature 

hardware such as joysticks, trackballs, light pens, or any for the initial object window based on previous images of 

other appropriate interface mechanism can also be used. that creature. After specification of the object window is 

Display screen 140 provides a graphical user interface (GUI) 50 complete, processing is transferred to object window inten- 

for the user. Computer system 100 can be implemented sity image generation 320. 

using various computer systems such as workstations or Object window intensity image generation 320, generates 

IBM compatible personal computers. an intensity image of the object window. For gray scale 

FIG. 2 is an example of GUI 200 which can be used with images the intensity image is simply the grayscale value of 

the present invention. Frames of the video sequence are 55 each picture element (pixel) of the window. Color images 

displayed in image window 210. Information regarding the can be converted to grayscale images to derive the intensity 

specific frame in image window 210 is displayed in infor- image. However, better results are obtained by using an 

mation panel 220. For example, in the particular embodi- intensity image for each color component of the object 

ment of FIG. 2, information panel 220 displays the frame window. Therefore, the system will generate a red intensity 

number of the frame in image window 210, the template 60 image, a green intensity image, and a blue intensity image 

number used for matching, the length of time of the video for a color image. The exact method used to convert the 

sequence already processed, as well of the confidence level video sequence into intensity images is conventional and 

of the current match. Command panel 230 provides various depends on the format used for creation and storage of the 

control buttons to control the system . Menu bar 240 provides video sequence . 

additional commands for the system. The user can select the 65 Processing then transfers to next frame intensity image 

various commands using pointer 250 which is controlled by generation 330, where the next frame of the video sequence 

the hardware user interface 130. is converted into an intensity image. 
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Then, in next frame position estimation 340, the system 
estimates the location of the object in the next frame based 
on the center of the object in the previous frames. The 
number of previous frames used in the estimation will 
depend on the particular estimation method used. In one 5 
embodiment of the invention, an estimation method using a 
least square fit of a quadratic polynomial is used to estimate 
the location of the object in the next frame. Specifically, this 
embodiment attempts to estimate C^ +1 , the center of the 
object in frame number N+l, where C^ +a is a vector [X^ +1 , 10 
Y;v + i] using the equation: 



0) 



where i, j, and k are also two dimensional vectors to be 
determined using equation (4). The horizontal coordinate 
and the vertical coordinate of the center are calculated 
independently. Therefore, equation (1) can be thought of as 
two independent equations: 



and 



(2) 



20 



(3) 



25 



however, the two-dimensional form in equation (1) will be 
used for conciseness. 

The values of i, j, and k are estimated by finding the i, j, 
and k which minimizes the following equation: 



N 

2 ||C ( ,-(/*rf z +; *d + k)\\ 2 



(4) 



where C d is the center of the object window of frame number 
d. Well known methods are used to solve equation (4). When 35 
N is less than three, there are not enough data points to solve 
equation (4) accurately. Therefore, when N is less than three 
a linear estimate is used instead of the quadratic polynomial. 
Specifically, the value of i is set to zero and the values of j 
and k are estimated. After estimating the center of the object 40 
in the next frame, processing passes to test windows deter- 
mination 350. 

In test window determination 350, the system will create 
a set of test windows in the next frame. The set of test 
windows is composed of every possible window of the same 
size and shape as the object window having a center pixel 
less than a predefined distance from the estimated center 
point. Once the set of test windows is determined, process- 
ing is transferred to intensity distance calculation 360. 

Within intensity distance calculation 360, the intensity 
distance between the object window and each test window 
is calculated. Specifically, for a specific test window, T, the 
system generates an intensity distance window, ID^Xjy), of 
the same size and shape as the object window and test 
windows, by subtracting the intensity of each pixel in the 55 
test window, T, from the intensity of the corresponding pixel 
in the object window. Then the system calculates an average 
intensity distance, A r , for test window T by averaging the 
distance value of every distance in the intensity distance 
window, \Dj(x t y). Finally the system calculates the intensity 60 
distance ID r for test window T by summing the absolute 
values of the difference between each distance in distance 
window, IDy<x,y), and the average distance A r . Calculation 
of the intensity distance is summarized by the following 
equations: 



45 



50 



65 



rD r =X\fD J {x,y)-A I \ 



(6) 



(5) 



(7) 

where OW(x 7 y) is the intensity image of the object window, 
M is the number of pixels in the object window, and the 
summations are across all M pixels. 

As stated above, for color images better results are 
obtained by calculating an intensity image for each color 
component. To determine the intensity distance between a 
color object window intensity image and a color test 
window, an intensity distance for each color component is 
calculated. The intensity distance of the specific test window 
is then set to the maximum of the intensity distances of the 
color components. 

After the intensity distance for each test window is 
calculated, the system selects the test window with the 
lowest intensity distance as the best match window in best 
match selection 370. Processing then passes to match thresh- 
old comparison 380. 

In match threshold comparison 380, the intensity distance 
of the best match window is compared to a predetermined 
match threshold. If the intensity distance of the best match 
window is less than the predetermined threshold, the track- 
ing is acceptable and processing is passed to last frame 
check 384. If the last frame is reached the system ends and 
awaits further commands from the user. Otherwise, the 
system advances to the next frame in frame advance 386. 
Since most objects will not travel very far in each frame, the 
system can obtain reliable results even if only a subset of the 
frames of the video sequence is processed. Therefore, some 
embodiments of the invention will advance several frames in 
frame advance 386. Then processing passes to next frame 
intensity image generation 330 and the next frame of the 
video sequence will be processed in the manner just 
described. However, if the intensity distance of the best 
match window is greater than the match threshold, process- 
ing passes to display warning 390, where the system dis- 
plays a warning on GUI 200 (FIG. 2) and awaits further 
commands from the user. 

FIG. 4 shows a process flow diagram of a second embodi- 
ment 400 of the invention. This embodiment is similar to 
that illustrated in FIG. 3. Consequently, the complete 
description is not repeated. Rather, only the modifications to 
the structures and operations in FIG. 3 are described. In the 
second embodiment the chamfer distance between edge 
images are used in place of the intensity distance between 
intensity images. A chamfer distance between two edge 
images is calculated by averaging the distance of each edge 
pixel on the first edge image to the closest edge pixel in the 
second image. Specifically, in FIG. 4, object window inten- 
sity image generation 320 is replaced with object window 
edge image generation 420; next frame intensity image 
generation 330 is replaced with next frame edge image 
generation 430; and intensity distance calculation 360 is 
replaced with chamfer distance calculation 460. 

In object window edge image generation 420, an edge 
detection method is used to create an edge image of the 
object window. The edge detection method extracts the 
skeletal structure of the object by finding the edges of the 
object in the image. Any reasonably good edge detection 
method can be used with the present invention. A specific 
method of edge detection suitable for use in connection with 
the present invention is shown on pages 1355-1357 of Hu, 
et al., "Feature extraction and Matching as Signal 
Detection", International Journal of Pattern Recognition 
and Artificial Intelligence, Vol. 8, No. 6, 1994, pp. 
1343-1379." the disclosure of which is hereby incorporated 
by reference. 
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In next frame edge image generation 430, an edge detec- 
tion method is used to create an edge image of the next frame 
of the video sequence. As with object window edge image 
generation 420, any reasonable edge detection method can 
be used. However, for more consistent results object window 
edge image generation 420 and next frame edge image 
generation 430 should use the same edge detection method. 

In chamfer distance calculation 460, the system calculates 
the chamfer distance between the object window edge image 
and each test window, as determined in test window deter- 
mination 350. A chamfer distance between two edge images 
is calculated by averaging the distance of each edge pixel on 
the first edge image to the closest edge pixel in the second 
image. Any chamfer distance calculation method can be 
used with the present invention. A specific method of 
chamfer distance calculation suitable for use in connection 
with the present invention is shown on pages 1360-1362 of 
the above -described article by Hu, et al. 

Best match selection 370 and match threshold comparison 
380 are functionally identical in the embodiments of FIG. 3 
and FIG. 4. However, in the embodiment of FIG. 3, the 
distances and match threshold are intensity distances while 
in the embodiment of FIG. 4, the distances and match 
threshold are chamfer distances. 

FIG. 5 shows a process flow diagram of another embodi- 
ment of the present invention. This embodiment is similar to 
that illustrated in FIG. 3. Consequently, the complete 
description is not repeated. Rather, only the modifications 
shown in FIG. 5 to the structures and operations in FIG. 3 
are described. The embodiment of FIG. 5 improves the 
embodiment of FIG. 3 by allowing the system to automati- 
cally replace the object window with a best match window 
from a previous frame. The embodiment of FIG. 5 functions 
identically with the embodiment of FIG. 3 until after match 
threshold comparison 380. If the best match window is 
acceptable, processing passes to replacement object window 
threshold comparison 595, where the distance of the best 
match window is compared to a replacement object window 
threshold. If the distance of the best match window is greater 
than the replacement object window threshold, the best 
match window is an acceptable replacement for the object 
window. Processing then passes to store replacement object 
window 596, where the best match window is stored as a 
replacement object window in a first-in-last-out arrange- 
ment. The maximum number of acceptable replacement 45 
windows retained in the first-in-last-out arrangement is 
preferably user selectable. Preferably, when the maximum 
number of acceptable replacement windows is reached the 
oldest acceptable replacement window is eliminated. The 
system then continues to last frame check 384. If the 
distance of the best match window is less than the replace- 
ment object threshold, the best match window is not an 
acceptable replacement for the object window and process- 
ing passes directly to last frame check 384. 

If the best match window is found to be unacceptable in 
match threshold comparison 380, processing passes to dis- 
play first warning 591, where the system displays a warning 
message that the system is attempting to replace the object 
window. In replacement object window available 592, the 
system checks to see if any acceptable replacements for the 
window object have been stored in the first-in-last-out 
arrangement. If a replacement window is available, the 
system will replace the object window with the most 
recently stored acceptable replacement window and remove 
that replacement window from the first-in-last-out arrange- 
ment. Processing is then returned to intensity distance cal- 
culation 360 with the replaced object window. 
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If no acceptable replacement window is available in the 
first-in-last-out arrangement, the system will display a sec- 
ond warning from display second warning 593. The system 
then awaits farther commands from the user. 

FIG. 6 shows a process flow diagram of another embodi- 
ment 600 of the invention. This embodiment is similar to 
that illustrated in FIG. 5. Consequently, the complete 
description is not repeated. Rather, only the modifications in 
FIG. 6 to the structures and operations in FIG. 5 are 
described. In this embodiment the chamfer distance between 
edge images is used in place of the intensity distance 
between intensity images. Specifically, object window inten- 
sity image generation 320 is replaced with object window 
edge image generation 420; next frame intensity image 
generation 330 is replaced with next frame edge image 
generation 430; and intensity distance calculation 360 is 
replaced with chamfer distance calculation 460. The conse- 
quence of these changes is described above with regard to 
FIG. 4. 

The various embodiments of the structure and method of 
this invention that are described above are illustrative only 
of the principles of this invention and are not intended to 
limit the scope of the invention to the particular embodi- 
ments described. In view of this disclosure, those skilled- 
in-the-art can define other implementations of edge 
detection, chamfer matching, motion estimation, distance 
measures, GUI, matching criteria, replacement criteria, and 
use these alternative features to create a method or system of 
object tracking according to the principles of this invention. 

We claim: 

1. An object tracking method for automatically tracking 
an object in a video sequence, the object being chosen by an 
user, comprising: 

specifying an object window enveloping said object in a 
first frame of said video sequence; 

estimating a position of said object in a second frame of 
said video sequence; 

determining a plurality of test windows in said second 
frame within a predetermined distance of said esti- 
mated position of said object; 

comparing said object window with each of said test 
windows, comprising 

calculating a distance between said object window and 

each of said test windows; and 
selecting a best match window from said plurality of 

test windows, said best match window having a 

lowest distance between said object window and said 

test windows; 

comparing said lowest distance to a predetermined match 
threshold; 

displaying a warning if said lowest distance is greater than 

said predetermined match threshold; 
comparing said lowest distance to a predetermined 

replacement threshold; and 
specifying a replacement object window if said lowest 

distance is greater than said predetermined replacement 

threshold and said lowest distance is less than said 

predetermined match threshold. 

2. An object tracking method for automatically tracking 
an object in a video sequence, the object being chosen by an 
user, comprising: 

specifying an object window enveloping said object in a 
first frame; 

processing a plurality of sequential subsequent frames, 
each of said sequential subsequent frames having a set 
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of previous frames, by, for each of said sequential 
subsequent frames, 
estimating a position of said object in each of said 
sequential subsequent frames by 
fitting the position of said object in each of said 

previous frames in a quadratic polynomial having 

coefficient vectors i, j, and k, 
approximating i, j, and k using a selected fitting 

method, and 

calculating said position of said object in said sequen- 
tial subsequent frame using i, j, and k in a quadratic 
polynomial; 

determining a plurality of test windows in each of said 
sequential subsequent frames within a predetermined 
distance of said estimated position of said object; 15 

comparing said object window with each of said test 
windows in each of said sequential subsequent frames; 
and 

selecting a best match window from said plurality of test 2 o 
windows for each of said sequential subsequent frames. 

3. An object tracking method as in claim 2, wherein said 
selected fitting method is a least squares fitting method. 

4. An object tracking method for automatically tracking 

an object in a video sequence, the object being chosen by an 2 s 
user, comprising: 

specifying an object window enveloping said object in a 
first frame; 

processing a plurality of sequential subsequent frames by, 
for each of said sequential subsequent frames; 



estimating a position of said object in each of said 

sequential subsequent frames; 
determining a plurality of test windows in each of said 

sequential subsequent frames within a predetermined 

distance of said estimated position of said object; 
comparing said object window with each of said test 

windows in each of said sequential subsequent frames, 

calculating a distance between said object window and 
each of said test windows, and 

selecting a best match window from said plurality of 
test windows for each of said sequential subsequent 
frames, said best match window having a lowest 
distance between said object window and said test 
windows; 

comparing said lowest distance to a predetermined match 
threshold; 

displaying a warning if said lowest distance is greater than 

said predetermined match threshold; 
comparing said lowest distance to a predetermined 

replacement threshold; and 
specifying a replacement object window if said lowest 

distance is greater than said predetermined replacement 

threshold and said lowest distance is less than said 

predetermined match threshold. 
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